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Abstract 

Granular association rule mining is a new relational data mining approach to 
reveal patterns hidden in multiple tables. The current research of granular 
association rule mining considers only nominal data. In this paper, we study the 
impact of discretization approaches on mining semantically richer and stronger 
rules from numeric data. Specifically, the Equal Width approach and the Equal 
Frequency approach are adopted and compared. The setting of interval numbers 
is a key issue in discretization approaches, so we compare different settings 
through experiments on a well-known real life data set. Experimental results 
show that: 1) discretization is an effective preprocessing technique in mining 
stronger rules; 2) the Equal Frequency approach helps generating more rules 
than the Equal Width approach; 3) with certain settings of interval numbers, 
we can obtain much more rules than others. 

Keywords: Granular association rule, discretization, Equal Width, Equal 
Frequency, relational data mining. 



1. Introduction 

Relational data mining schemes [7J [5] look for patterns that include multiple 
tables in the database. Some meaningful issues H2 [Till H3 M are undisputed 
more common and more challenging than their transcriptions on a single data 
table. Recently people focus on the tasks of association rule and computing 
with granules [H W\ (Ml 133 HE] • 

Granular association rule mining [161 117] is a new approach to reveal patterns 
hidden in multiple tables. This approach generates rules with four measures to 
reveal connections between concepts in two universes. We consider a database 
with two entities customer and product connected by a relation buys. An 
example of granular association rules might be "40% men like at least 30% kinds 
of alcohol; 45% customers are men and 6% products are alcohol." Here 45%, 6%, 



* Corresponding author. Tel.: +86 133 7690 8359 
Email addresses: hexu_grclab0163.com (Xu He), minfanphd_163.com (Fan Min), 
willlamfengzhu_gmail.com (William Zhu) 



Preprint submitted to Elsevier 



December 13, 2012 



40% and 30% are the source coverage, the target coverage, the source confidence 
and the target confidence, respectively. Numeric data are very common in real 
world problems. Unfortunately, only nominal data are considered in the original 
definition of granular association rule [HO [T7] . 

We employed two discretization approaches, called the Equal Width ap- 
proach and the Equal Frequency approach |H to preprocess the numeric 
data. The Equal Width approach confirms the minimum and maximum of the 
numeric data, and divides the range into k equal-width discrete intervals. The 
Equal Frequency approach confirms the minimum and maximum of the nu- 
meric data, and divides the range into k intervals which have the same number 
of sorted values in ascending order. Compare those two approaches by gener- 
ated rules and candidates, we can obtain the strength one applied to granular 
association rule mining. 

Experiments are undertaken on the publicly available MovieLens data set. 
We introduce two parameters fcj and k 2 . ki is the number of intervals for the 
age of the user, fc 2 is the number of intervals for the released year of the movie. 
The discretization approaches are implemented with Java in our open source 
software COSER (Cost sensitive rough set) [2"2"] . 

Our experiment results show that discretization is effective preprocessing 
technique in mining stronger rules. The Equal Frequency and the Equal Width 
approach are both simple methods to discretize data, while achieving good re- 
sults. Given four measures thresholds, the Equal Frequency generates more 
rules than the other one. For any pair of integers (k%, /c 2 ), we can obtain a set 
of rules. Through comparing the number of all the sets of rules, we obtain cer- 
tain settings of discrete interval numbers through different approaches. When 
setting k\ range from 8 to 10 and fc 2 range from 10 to 12 through the Equal 
Frequency approach, we can obtain much more rules than other settings. 

The remainder of the paper is organized as follows. Section [2] reviews gran- 
ular association rule. Section [3] presents granular association rules on numeric 
data, we might mine semantically richer and stronger rules. In Section [4j we 
describe each discretization approach and discuss its suitability for granular as- 
sociation rule mining. Experiments on the MovieLens data set [T] are discussed 
in Section [5j Finally, Section [6] presents the concluding remarks and further 
research directions. 

2. Granular association rule 

In this section, we revisit granular association rule [17] , We analyse the 
definition, and four measures of such rule. Moreover, we introduce the basic 
design of granular association rule mining. 

2.1. The data model 

First of all, we introduce the data model which is built on information sys- 
tems and binary relations. 
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Definition 1.5= (U, A) is an information system, where U = {xi,X2, ■ ■ ■ , x n } 
is the set of all objects, A = {ai, 02, ... , a m } is the set of all attributes, and 
dj(xi) is the value of Xi on attribute aj for i G [l..n] and j G [l..m]. 

In an information system, any A' C A induces an equivalence relation 23, 25 

E A , = {(x, y)eUx U\Va G A', a(x) = a(y)}, (1) 

and partitions U into a number of disjoint subsets called blocks. The block 
containing x € U is 

*U'(aO = fe£ C^IVa G A',a(y) = a(x)}. (2) 

From another viewpoint, a pair C — (A',x) where x £ U and A' C A is called 
a concept. The extension of the concept is 

ET(C) = ET(A',x) =E A ,(x); (3) 

while the intension of the concept is the conjunction of respective attribute- 
value pairs, i.e., 

IT(C) = IT(A',x)= f\ (a:a(x)). (4) 

a£A' 

The support of the concept is the size of its extension divided by the size of the 
universe, namely, 

support{C) = support{A' ,x) — support{J\ aeA , (a : a(x)}) 

= support{E A ,{x)) = 1^(^)1 " (5) 

_ \E A >{x)\ 

~ \u\ ■ 

Definition 2. Let U = {xi, X2, ■ ■ ■ , x n } and V — {yi, y%, . . . , yt] be two sets of 
objects. Any R C U x V is a binary relation from U to V. The neighborhood 
of x e [/ is 

^-{j/eFl^j/Jei?}. (6) 

If U — V and R is an equivalence relation, R(x) is the equivalence class 
containing x. From this definition we know immediately that for y G V , 

R- 1 (y) = {xEU\(x,y)ER}. (7) 

A binary relation is more often stored in the database as a table with two 
foreign keys. In this way the storage is saved. For the convenience of illustration, 
here we represented it with an n x k boolean matrix. 

With Definitions [1] and [2] we propose the following definition. 

Definition 3. 16J A many-to-many entity-relationship system (MMER) is a 
5-tuple ES = (U, A, V, B, R), where (U, A) and (V, B) are two information sys- 
tems, and R C U x V is a binary relation from UtoV. 
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2.2. Granular association rule with jour measures 

Now we come to the central definition of granular association rules. 

Definition 4. |i6j A granular association rule is an implication of the form 

(GR): /\ (a:a(x))^ f\ (b:b(y)), (8) 

aeA> b£B' 

where A' C A and B' C B. 

According to Equation ([5]), the set of objects meeting the left-hand side of 
the granular association rule is 

LH(GR) = E A ,(x); (9) 

while the set of objects meeting the right-hand side of the granular association 
rule is 

RH{GR) = E B ,{y). (10) 
The source coverage of a granular association rule is 

scoverage{GR) = \LH{GR)\/\U\. (11) 

The target coverage of GR is 

tcoverage{GR) = \RH(GR)\/\V\. (12) 

There is a tradeoff between the source confidence and the target confidence 
of a rule. Consequently, no values can be obtained directly from the rule. To 
compute any one of them, we should specify the threshold of the other. Let tc 
be the target confidence threshold. The source confidence of the rule is 

|{xaff(Gij)|H"> fc }| 

sconfidence(GR,tc) = \lh(GR^ ' (13) 

Let mc be the source confidence threshold, and 

|{a: G LH(GR)\\R(x) n RH(GR)\ > K + 1}| 

< mc x \LH(GR)\ (14) 

< \{x e LH(GR)\\R{x) n RH(GR)\ > K}\. 

This equation means that mc x 100% elements in LHiGR) have connections 
with at least K elements in RH(GR), but less than mc x 100% elements in 
LH(GR) have connections with at least K + 1 elements in RH(GR). The target 
confidence of the rule is 

tconfidence(GR,mc) = K/\RH(GR)\. (15) 

In fact, the computation of K is non-trivial. First, for any x € LH(GR), we need 
to compute tc(x) = \R(x) n RH(GR)\ and obtain an array of integers. Second, 
we sort the array in a descending order. Third, let k = [mc x \LH(GR)\\, K is 
the fc-th element in the array. 
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2.3. Granular association rule mining 

The basic design of granular association rule mining is as follows. 

Definition 5. The granular association rule mining. 

Input: An ES = (U, A,V, B, R), a minimal source coverage threshold ms, 
a minimal target coverage threshold mt, a minimal source confidence threshold 
mc, and a minimal target confidence threshold tc. 

Output: All granular association rules satisfying scoverage(GR) > ms, 
tcoverage(GR) > mt, sconfidence(GR) > mc, and tconfidence(GR) > tc. 

3. Granular association rule on numeric data 

There are many different types of data to describe objects. Recently, all data 
are implicitly considered to be nominal. However, in the real world applications, 
a very large proportion of data sets involve numerical data. One scheme to solve 
this problem is to divide numeric data into a number of intervals and regard 
each interval as a category. This process is usually named discrerization [? 
[3 [13 HH1 H3 QD- At present, the most important thing we intend to do is 
that we can mine semantically richer and stronger rules which cannot mine 
in primary data through discretization. For instance, we give an information 
system in Table [I] where U = {cl, c2, c3, c4, c5, c6, c7, c8, c9, clO}, and A 
= {Age, Gender, Married, Salary}. Among them, Age and Salary values are 
numeric data. Another example is given by Table [2] where U = {pi, p2, p3, 
p4, p5, p6, p7, p8}, and A — {Country, Category, Color, Price}. Among them, 
Price values are numeric data. 

A binary relation is more often stored in the database as a table with two 
foreign keys. In this way the storage is saved. For the convenience of illustration, 
here we represented it with an n x k boolean matrix. An example is given by 
Table [3j where U is the set of customers as indicated by Table [l] and V is the 
set of products as indicated by Table [2j 

At present, we indicate all of the numeric data from the information systems. 
And then divide numeric data into a number of intervals and regard each interval 
as a category, as shown in Tables [4j [5] From the MMER given by Tables [3J 
[4p,nd[5]we may obtain the following interesting rule. 
(Rule 1) (Gender: Male) ^> (Category: Alcohol). 
(Rule 2) (Age: [30, 35)) A (Gender: Male) (Category: Alcohol). 
(Rule 3) (Married: Yes) => (Country: China). 
(Rule 4) (Married: Yes) A (Salary: [4700, 5600]) 
=*> (Country: China) A (Price: [2.0, 7.3)). 

Rule 1 can be read as "men like alcohol." Rule 2 can be read as "men whose 
age is between 30 and 35 like alcohol." Rule 3 can be read as "Married people 
like products made in China." Rule 4 can be read as "Married people whose 
salaries are between 4700 and 5600, like products made in China, which prices 
are between 2.0 and 7.3." 

From above we can come to a conclusion, we can mine semantically richer and 
stronger rules which cannot be mined in primary data through discretization, 
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Table 1: Customer 



CID 


Age 


Gender 


Married 


Salary 


cl 


20 


Male 


No 


2000 


c2 


25 


Female 


Yes 


2800 


c3 


23 


Male 


No 


3500 


c4 


26 


Female 


Yes 


2400 


c5 


32 


Male 


Yes 


5600 


c6 


36 


Male 


Yes 


4200 


c7 


39 


Male 


Yes 


5000 


c8 


40 


Female 


Yes 


5000 


c9 


35 


Female 


Yes 


3400 


clO 


34 


Male 


Yes 


3600 


Table 2: Product 


PID 


Country 


Category 


Color 


Price 


Pi 


China 


Staple 


Yellow 


2.0 


P 2 


Australia 


Staple 


Black 


4.0 


p3 


China 


Daily 


White 


5.5 


P 4 


China 


Meat 


Red 


8.0 


P 5 


Australia 


Meat 


Red 


18.0 


P 6 


China 


Alcohol 


Yellow 


3.0 


p7 


France 


Alcohol 


Yellow 


5.0 


P 8 


France 


Alcohol 


White 


16.5 



such as Rules 2, 4. Given the same four measures threshold, Rule 2 has a 
semantically richer rule than Rule 1, and Rule 4 has a richer rule than Rule 3. 
A detailed explanation of Rule 4 might be "60% married people like at least 
60% products, which prices are between 2.0 and 7.3; 70% customers are married 
people, 62.5% products of all products which prices are between 2.0 and 7.3." 

4. Discretization approaches 

In this section, we introduce different discretization approaches, which can 
divide the numeric data into different intervals and regard each interval as a 
category. Given four measures thresholds, we can mine different rules. Since 
the number of intervals is a key issue in discretization approaches, we try to use 
some different settings of interval numbers to can obtain the suitable one. Then 
we can mine appropriate granule association rules. 

In this paper, we adopt two discretization approaches, namely the Equal 
Width approach and the Equal Frequency approach. The two approaches are 
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Table 3: Buys 



CID\ PID 


Pi 


p2 


p3 


P 4 


p5 


p6 


p7 


p8 


cl 


1 








1 


1 


1 








c2 


1 








1 





1 








c3 








1 





1 





1 


1 


c4 










1 


1 


1 








c5 







1 


1 








1 


1 


c6 










1 








1 





c7 


1 




1 


1 








1 


1 


c8 







1 





1 


1 


1 





c9 


1 





1 





1 





1 





clO 


1 





1 





1 





1 


1 



Tabic 4: Discretization for Age and Salary 



CID 


Age 


Gender 


Married 


Salary 


cl 


[20,25) 


Male 


No 


[2000, 2900) 


c2 


[25,30) 


Female 


No 


[2000, 2900) 


c3 


[20,25) 


Male 


No 


[2900, 3800) 


c4 


[25,30) 


Female 


Yes 


[2000, 2900) 


c5 


[30,35) 


Male 


Yes 


[3800, 4700] 


c6 


[35,40] 


Male 


Yes 


[2900, 3800) 


c7 


[35,40] 


Male 


Yes 


[4700, 5600) 


c8 


[35,40] 


Female 


Yes 


[4700, 5600) 


c9 


[35,40] 


Female 


Yes 


[2900, 3800) 


clO 


[30,35) 


Male 


Yes 


[2900, 3800) 



both simple methods to discretize data and have often been used to produce 
nominal data from numeric ones. 

4-.1. The Equal Width approach 

The Equal Width approach confirms the minimal value ao and the maximal 
value dfc of the numeric data, and divides the range into k equal-width discrete 
intervals. Here k is a parameter supplied by the user. The approach calculates 
the discretization width 

A= afc 7 a ° . (16) 
k 

These values form the boundary set {ao,a\, ...,aj, dk-i, o-k} for {[ao,ai), 
[aj_i, aj), [a/j-i, flfe]}, at — ao + iX, where i = 1,2, k. The approach is 

applied to each numeric data independently. Finally, we obtain discretization 

data. 
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Table 5: Discretization for Price 



PID 



Country 



Category 



Color 



Price 



pl 
p2 
p3 
p4 
p5 
p6 
p7 
p8 



Australia 



China 
Australia 



China 
France 
France 



China 
China 



Staple 
Staple 
Daily 
Meat 
Meat 
Alcohol 
Alcohol 
Alcohol 



Yellow 
Black 
White 
Red 
Red 
Yellow 
Yellow 
White 



[2.0, 7.3) 
[2.0, 7.3) 
[2.0, 7.3) 
[7.3, 12.7) 
[12.7, 18.0] 
[2.0, 7.3) 
[2.0, 7.3) 
[12.7, 18.0] 



4-2. The Equal Frequency approach 

The Equal Frequency approach confirms the minimal value bo, the maximal 
value b k of the numeric data, and sorts the values from in ascending order. Here 
k is a parameter supplied by the user. Divide the range into k of intervals in 
order that every interval involves the same number of sorted values, These values 
form the boundary set {b ,bi,b 2 , ...,b k -i,b k } for {[b ,bi), [61,62), [b k -i,b k ]}. 

We set different interval number k to divide the numeric data, and use 
different discretization approaches to produce different intervals. We know that 
more interval numbers, higher confidence of intervals, and lower coverage of 
intervals. Compare those intervals to get the suitable one for rule mining. For 
example, Table [2] shows that the value of Price range from 2.0 to 18.0. Set 
k = 3, we get the price of p3 is between 2.0 and 7.3 with the Equal Width 
approach, while it is between 2.0 and 7.0 with the Equal Frequency approach. 
Set k — 4, we get the price of p3 is between 2.0 and 6.0 with the Equal Width 
approach, while it is between 2.0 and 5.5 with the Equal Frequency approach. 
Comparing those intervals, we obtain that take advantage of interval numbers 
and discretization approach is very important to produce suitable intervals for 
mining rule. 

5. Experiments on a real world data set 

5.1. A movie rating data set 

The MovieLens data set [T] assembled by the GroupLens project is widely 
used in recommender systems (see, e.g., [H [TTJ |2H EH 121] ) • We downloaded the 
data set from the Internet Movie Database jT]. The data set contains 100,000 
ratings (1-5) from 943 users on 1,682 movies, with each user rating at least 20 
movies [24]. In order to run our algorithm, we preprocessed the data set as 
follows. 

1. Remove movie names. They are not useful in generating meaningful gran- 
ular association rules. 
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2. Use release year instead of release date. In this way the granule is more 
suitable. 

3. Select the movie genre. In the original data, the movie genre is multi- 
valued since one movie may fall in more than one genre. For example, 
a movie can be both Animation and Children's. Unfortunately, granular 
association rules do not support this type of data at this time. Since the 
main objective of this work is to test compare the performances of algo- 
rithms, we use a simple approach to deal with this issue. That is to sort 
movie genres according to the number of users they attract, and only keep 
the one highest priority for the current movie. We adopt the following pri- 
ority (from high to low): Comedy, Action, Thriller, Romance, Adventure, 
Children, Crime, Sci-Fi, Horror, War, Mystery, Musical, Documentary, 
Animation, Western, FilmNoir, Fantasy, Unkown. 

Our database schema is as follows. 

• User ( userlD , age, gender, occupation) 

• Movie ( movicID , releaseYear, genre) 

• Rates (userlD, movicID) 

According to given intervals [0,18), [18,25), [25,30), [30,35), [35,45), [45,56), 
[56, oo), the age of the user is discretized by the GroupLens project. And then 
we use release decade instead of release date for the movies range from 1920s 
to 1990s. As a result, a manual discretization setting is given to divide numeric 
data to obtain a finer granule. The setting would be used to compare with other 
discretization approaches. 

5.2. Results 

In this section, we try to answer the following problems through experimen- 
tation. 

1. Compared with the manual discretization setting to mine rules, Which 
approach outperform, the Equal Width approach or the Equal Frequency 
approach? 

2. Whether we can mine much semantically richer rules through discretiza- 
tion? 

3. What are the certain settings of discrete interval numbers for the numeric 
data? 

We undertake three sets of experiments to answer the questions one by one. 

5.2.1. The performance of discretization approaches 

The evaluation of discretization approaches was performed using the number 
of generated rules and candidates. We compare the Equal Width approach, the 
Equal Frequency approach, the manual discretization setting and primary data 
which is without discretization. Let mc = 0.15, tc = 0.17 and ms = mt G {0.04, 
0.06, 0.08, 0.10, 0.12}. Suppose k is the number of intervals. We set k = 4 and 
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k = 8 for rule mining, respectively. We compare the number of candidates and 
rules, as shown in Figures [I] [2j 

Figures [l] [2] show that all discrete approaches can help to mine more candi- 
dates and rules from discreted data than not do it from primary data, and the 
Equal Frequency mine the most. When ras = rat = 0.12, the Equal Frequency 
can still mine rules, but the others cannot mine any. 

We compare the Equal Width approach and the manual discretization set- 
ting. When k = 4, the number of candidates and rules of the Equal Width 
approach and the manual discretization setting have big different, the reason is 
that a interval may divide into some intervals, which have affects on the number 
of generated rules. For example, the Equal Width approach obtain a interval 
[1979, 1998], which includes 1980s and 1990s. Specifically, when k = 8, the 
number of candidates and rules of them is very similar, the reason is that each 
interval of them is very similar. 

5.2.2. The semantically richer 

We obtain some strong rules using Equal Width and Equal Frequency. Here 
we set interval number k = 4, ras = rat = 0.06, rac = 0.15, and tc = 0.17. 43 
and 68 granular association rules are respectively obtained by Equal Width and 
Equal Frequency. We respectively list 4 rules of them below. 
The Equal Width approach: 
(Rule 6) (age [7,24)) 

=>■ (genre: action) 
(Rule 7) (age [7,24)) A (gender: male) 

=>■ (genre: action) 
(Rule 8) (age [7,24)) A (gender: male) 

=S> (releaseYear: [1979,1998]) A (genre: action) 
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Figure 2: Number of rules: (a) interval number k = 4; (b) interval number k = 8. 



(Rule 9) (age [7,24)) A (gender: male) A (occupation: student) 

=>■ (releaseYear: [1979,1998]) A (genre: action) 
The Equal Frequency approach: 
(Rule 10) (age [7,25)) 

=>■ (genre: action) 
(Rule 11) (age [7,25)) A (occupation: student) 

=>■ (genre: action) 
(Rule 12) (age [25,31)) A (gender: male) 

=>■ (releaseYear: [1992,1995]) A (genre: comedy) 
(Rule 13) (age [7,25)) A (gender: male) A (occupation: student) 
(releaseYear: [1992,1995]) A (genre: comedy) 

All rules are quite meaningful from different discrete approaches, and they 
might be applied to movie recommendation directly. For Rule 6 indicates that 
user whose age range from 7 to 24 rate action movies. We observe that Rule 7 
and Rule 8 is finer than Rule 6, which is in turn semantically richer than Rule 
6. Rule 9 obtains the semantically richest rule. For Rule 11 indicates that user 
whose age range from 7 to 25 rate action movies. We observe that Rule 11 is 
finer than Rule 10, it is similar to the above. Rule 12 mine user age range 25 
to 31, but not range 7 to 25, and Rule 13 mine movie genre is comedy but not 
action, those rules cannot be comparable with Rule 11, but still useful. 

5.2.3. The setting of interval numbers 

The setting of interval numbers is a key issue in discretization approaches, so 
we compare different settings through experiment. We introduce two parameters 
fci, lt2, ki is number of interval for the numeric data of User, ki is number of 
interval for the numeric data of Movie. 

We set ms = mt = 0.08, mc = 0.15, and tc = 0.17. Firstly we let k\ = 10 
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(a) (b) 



Figure 3: Number of rules: (a) k\ = 10; (b) &2 = 11. 

and let k 2 increases from 2 to 30, the number of rules are compared, as depicted 
in Figure ^ a). Secondly we let k 2 — 11 and let k\ increases from 2 to 30, the 
number of rules are compared, as drew in Figure [3^b) . Thirdly we let ki, k 2 
increase from 2 to 20, respectively, and obtain the corresponding to number of 
rules, we draw a three-dimensional figure, as shown in Figures [4] and [5] 

Figure [3]ja) shows the number of rules decreases as k 2 increases, the reason 
is more interval numbers and lower coverage of intervals, some rules do not 
satisfy mt that we cannot mine them. The Equal Frequency approach can mine 
much more rules than the Equal Width approach at begin. This is because the 
number of the users and the movies are well-distributed in the intervals divided 
by the Equal Frequency approach, more and more intervals can satisfy mt that 
we can mine much more rules. When k 2 = 12 of Equal Width and k 2 = 13 of 
Equal Frequency, the number of rules slumps, the reason is some rules do not 
satisfy mt. For example, when k 2 — 12, the number of candidates is 18 x 15, 
while k 2 — 13, the number of candidates is only 18 x 3, which is much less. 
Finally, the number of rules remains unchanged, because only these rules can 
be mined before k 2 =30. 

Figure [3^b) also shows the number of rules decreases as k\ increases, this is 
because more interval numbers and lower coverage of intervals, some rules do 
not satisfy ms that we cannot mine them. The Equal Frequency approach can 
mine much more rules when k\ is between 2 and 12. For the Equal Frequency 
approach, when k\ = 13, the number of rules slumps. This is because some 
rules do not satisfy ms. For instance, when k 2 = 12, the number of candidates 
is 19 x 14, while k 2 — 13, the number of candidates is only 8 x 14, which is much 
less. Between k\ = 14 and k± = 30, the number of rules remains unchanged, 
this is because only these rules can be mined. For the Equal Width approach, 
it decreases stable as k\ increases. 
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40-. 




Figure 4: Different settings of interval numbers obtain number of rules through the Equal 
Width approach 



Figures [4] and [5] indicate the number of rules changes with k\ and ki increase. 
The Equal Frequency approach can mine more rules than Equal Width. For 
Figure|4j while k\ range from 10 to 13 and ki range from 9 to 11, we can obtain 
more rules. For Figure [5] while k\ range from 8 to 10 and ki range from 10 
to 12, we can obtain more rules. Compare those two Figures, we observe that 
Figure [5] is more intuitive than Figure |4| 

5.3. Discussions 

Now we can answer the questions proposed at the beginning of this section. 

1. Discretization is an effective preprocessing technique in mining stronger 
rules, so it outperforms the primary data. Compared with the manual dis- 
cretization setting to mine rules and the Equal Width approach, the Equal 
Frequency approach generates more candidates number and stronger rules. 

2. Through discretization, we can obtain much semantically richer rules. 

3. When setting k\ range from 8 to 10 and ki range from 10 to 12 for Equal 
Frequency, we obtain certain settings of discrete interval numbers. 

6. Conclusions and further works 

In this paper, we introduced an evaluation and comparison of discretization 
approaches for granular association rule mining. With the help of discretization, 
we mined semantically richer and stronger rules. The Equal Frequency approach 
helped generating more rules than the Equal Width approach. We obtained 
certain settings of discrete interval to mine much more rules through different 
approaches. 

The following research topics deserve further investigation: 
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Figure 5: Different settings of interval numbers obtain number of rules through the Equal 
Frequency approach 



1. Preferable discretization approaches. In this work we adopt the Equal 
Width approach and the Equal Frequency approach. In fact, there are a 
lot of discretization approaches. Many approaches such as rough sets and 
decision trees would work better on discretized data [30 , 3T1 EES US] • We 
will try to choose some suitable discretization approaches, and design a 
more appropriate one for granular association rule mining. 

2. Intelligent choice. In practice, some data sets contain different numeric 
data of different attributes, and we use the same discretization approaches 
to deal with them. However, different algorithms adapt to different data, 
so that we try to group different algorithms to realize intelligent choice for 
discretization of the same data set. The improved scheme is more valuable 
in practical application. 
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